Efficient data processing to identify information and reformant data files, and applications thereof

ABSTRACT

The present disclosure is directed to systems and methods for identifying demographic information in a data file. The method may include: receiving the data file containing a plurality of fields of demographic information from a third-party, the data file having inconsistent or mislabeled nomenclatures for one or more fields of the plurality of fields or spurious demographic information; analyzing the data file using a machine learning model trained according to other data files to distinguish between each of the plurality of fields of demographic information, the machine learning model being based on a plurality of machine learning algorithms to identify different types demographic information; generating a score indicating a probability that each of the plurality of fields of demographic information was identified correctly; and generating a revised data file labeling each of the plurality of fields of demographic information based on the identified type.

BACKGROUND Field

This field is generally related to processing information.

Background

As technology advances, an ever increasing amount of demographicinformation is becoming digitized. For example, for healthcareproviders, demographic information may include, but is not limited, totheir name, address, specialties, academic credentials, certifications,and the like. This demographic information may be available from variouspublic data sources, such as websites. These websites may retrieve thedemographic information from underlying databases, such as state,county, city, or municipality databases, that store the data. Forexample, states may have licensing boards that maintain lists of alllicensed healthcare providers, along with their associated demographicinformation. In another example, health insurance companies may havepublic websites listing the healthcare providers, and associateddemographic information, in their network. In another example,healthcare providers may themselves set up public websites that listsuch demographic information about their practices.

Entities may have a need to maintain demographic information. Forexample, health insurance companies may have a need to maintaindemographic information about healthcare providers that need to bereimbursed for claimed services. To maintain the demographicinformation, these entities often attempt to collect and integrate thedemographic information from providers, hospitals, group practices, orthe like. Often times responses to requests for this information havepoor response rates, are poorly formatted, and may include inaccurateinformation. For example, the responses may be structured in an unknownformat, may include inconsistent or mislabeled headings, or may includespurious information. As such, the responses should be reviewed toverify the contents of the data provided and reformatted into aconsistent structure. However, the responses frequently includehundreds, if not thousands, of entries with any number of differenttypes of demographic data. Consequently, manually reviewing andreformatting data from these responses may be difficult, time-consuming,and expensive, and often takes weeks per file to complete. These costsand time delays significantly contribute to the administrative overheadcosts that account for about one third of healthcare premiums in theUnited States.

Thus, systems and methods are needed to improve reviewing andreformatting these responses into a validated format by automatingexpensive administrative tasks, thereby eliminating manual dataformatting and reducing wasteful spending.

BRIEF SUMMARY

In an embodiment, the present disclosure is directed to a method foridentifying demographic information in a data file. The method mayinclude receiving the data file containing a plurality of fields ofdemographic information from a third-party. The data file may includeinconsistent or mislabeled nomenclatures for one or more fields of theplurality of fields or spurious demographic information. The method mayalso include analyzing the data file using a machine learning modeltrained according to other data files to distinguish between each of theplurality of fields of demographic information. The machine learningmodel may be based on a plurality of machine learning algorithms toidentify different types demographic information. The method may furtherinclude generating a score indicating a probability that each of theplurality of fields of demographic information was identified correctly.The method may also include generating a revised data file labeling eachof the plurality of fields of demographic information based on theidentified type.

System and computer program product embodiments are also disclosed.

Further embodiments, features, and advantages of the invention, as wellas the structure and operation of the various embodiments, are describedin detail below with reference to accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings, which are incorporated herein and form partof the specification, illustrate the present disclosure and, togetherwith the description, further serve to explain the principles of thedisclosure and to enable a person skilled in the relevant art to makeand use the disclosure.

FIG. 1 illustrates a diagram of a network for communications between oneor more data sources and a system, according to aspects of the presentdisclosure.

FIG. 2 illustrates a diagram of a system for reviewing and reformattingdata files from the one or more data sources, according to aspects ofthe present disclosure.

FIGS. 3-5B illustrate example data files received from the one or moredata sources, according to aspects of the present disclosure.

FIG. 6 illustrates example revised data file, according to aspects ofthe present disclosure.

FIG. 7 illustrates a method of reformatting data from a data source,according aspects of the present disclosure.

FIG. 8 is an example computer system useful for implementing variousembodiments.

The drawing in which an element first appears is typically indicated bythe leftmost digit or digits in the corresponding reference number. Inthe drawings, like reference numbers may indicate identical orfunctionally similar elements.

DETAILED DESCRIPTION

Embodiments provide ways to review and reformat data files that includeinconsistent or mislabeled nomenclatures for one or more fields of aplurality of fields of demographic information or spurious demographicinformation, which would require weeks per file to review and reformatmanually. For example, embodiments may analyze the data file using amachine learning model trained according to other data files todistinguish between each of the plurality of fields of demographicinformation. The machine learning model may be based on a plurality ofmachine learning algorithms to identify different types demographicinformation. For example, analyzing the data file may be based on acombination of one or more of semantic content of the demographicinformation, a shape of the demographic information, or metadata. Inthis way, embodiments provide the ability to identify different types ofdemographic data. Embodiments may also generate a score indicating aprobability that each of the plurality of fields of demographicinformation was identified correctly. Embodiments may also generate arevised data file labeling each of the plurality of fields ofdemographic information based on the identified type. For example, therevised data file may be formatted based on the requirements of thethird-party that provided the original data file. In other words, therevised data file may be fully customizable based on individual requestsfor the restructured data. Thus, embodiments provide the ability toeffectively and efficient generate data files in a format that is mostuseful to the third party.

Furthermore, the present disclosure may implement a combination of aplurality of machine learning algorithms and rules, which improves thefunctionality of the computing device. Namely, the combination ofmachine learning algorithms and rules avoids overtraining, and thusovercomplicating, the machine learning model, thereby reducing theamount of resources, e.g., processing consumption and memory resources,required to generate reformatted data files. Additionally, in someaspects, the present disclosure may intelligently identify differenttypes of demographic information based on a sampled portion of the datafile, rather than the entire data file, which may include hundreds, ifnot thousands of entries. By identifying the different types ofdemographic information based on a sampled portion, the presentdisclosure may further reduce the amount of resources required togenerate reformatted data files.

In the detailed description that follows, references to “oneembodiment”, “an embodiment”, “an example embodiment”, etc., indicatethat the embodiment described may include a particular feature,structure, or characteristic, but every embodiment may not necessarilyinclude the particular feature, structure, or characteristic. Moreover,such phrases are not necessarily referring to the same embodiment.Further, when a particular feature, structure, or characteristic isdescribed in connection with an embodiment, it is submitted that it iswithin the knowledge of one skilled in the art to effect such feature,structure, or characteristic in connection with other embodimentswhether or not explicitly described.

FIG. 1 is a diagram illustrating a network 100 for communications over anetwork 110 between one or more data sources 105 and a system 115. Insome embodiments, the one or more data sources 105 may be any datasource that maintains databases of demographic information of one ormore individuals, such, as healthcare providers, including but notlimited to, doctors, dentists, physician assistants, nursepractitioners, nurses, or the like. Although the present disclosuredescribes the individuals as being healthcare providers, it should beunderstood by those of ordinary skill in the arts that presentdisclosure may be implemented accumulating data from any data source. Insome embodiments, the data sources 105 may be hosted on a server, suchas a host server, a web server, an application server, etc., a datacenter device, or a similar device, capable of communicating via thenetwork 110.

In some instances, the one or more data sources 105 may include a Centerfor Medicaid and Medicare (CMS) services data source, a directory datasource, a Drug Enforcement Agency (DEA) data source, a public datasource, a National Provider Identifier (NPI) data source, a registrationdata source, and/or a claims data source. The CMS data source may be adata service provided by a government agency. The database may bedistributed and different agencies organizations may be responsible fordifferent data stored in CMS data source. The CMS data source may alsoinclude data on healthcare providers, such as lawfully availabledemographic information and claims information. The CMS data source mayalso allow a provider to enroll and update its information in theMedicare Provider Enrollment System and to register and assist in theMedicare and Medicaid Electronic Health Records (EHR) IncentivePrograms.

The directory data source may be a directory of healthcare providers. Inone example, the directory data source may be a proprietary directorythat matches healthcare providers with demographic and behavioralattributes that a particular client believes to be true. The directorydata source may, for example, belong to an insurance company or a healthsystem, and can only be accessed and utilized securely with thecompany's consent.

The DEA data source may be a registration database maintained by agovernment agency such as the DEA. The DEA may maintain a database ofhealthcare providers, including physicians, optometrists, pharmacists,dentists, or veterinarians, who are allowed to prescribe or dispensemedication. The DEA data source may match a healthcare provider with aDEA number. In addition, DEA data source to may include demographicinformation about healthcare providers.

The public data source may be a public data source, perhaps a web-baseddata source such as an online review system. These data sources mayinclude demographic information about healthcare providers, area ofspecialty, and behavioral information such as crowd sourced reviews.

The NPI data source may be a data source matching a healthcare providerto a NPI. The NPI is a Health Insurance Portability and AccountabilityAct (HIPAA) Administrative Simplification Standard. The NPI is a uniqueidentification number for covered health care providers. Covered healthcare providers and all health plans and health care clearinghouses mustuse the NPIs in the administrative and financial transactions adoptedunder HIPAA. The NPI is a 10-position, intelligence-free numericidentifier (10-digit number). This means that the numbers do not carryother information about healthcare providers, such as the state in whichthey live or their medical specialty. NPI data source may also includedemographic information about a healthcare provider.

The registration data source may include state licensing information.For example, a healthcare provider, such as a physician, may need toregister with a state licensing board. The state licensing board mayprovide the registration data source information about the healthcareprovider, such as demographic information and areas of specialty,including board certifications.

The claims data source may be a data source with insurance claimsinformation.

Like the directory data source, the claims data source may be aproprietary database. Insurance claims may specify information necessaryfor insurance reimbursement. For example, claims information may includeinformation on the healthcare provider, the services performed, andperhaps the amount claimed. The services performed may be describedusing a standardized code system, such as ICD-9. The information on thehealthcare provider could include demographic information.

The one or more data sources 105 may receive data files from any numberof origins, e.g., multiple practice groups, other ones of the pluralityof data sources 105, etc. For example, the one or more data sources 105may receive responses to requests for demographic information from, forexample, medical practice groups, hospitals, or the like. Thisinformation may be entered by an administrator, and as such, the datafile may include inconsistent or mislabeled nomenclatures for one ormore fields of a plurality of fields of demographic information or itmay include spurious demographic information. As another example, theone or more data sources 105 may acquire another entity that utilizesdifferent nomenclatures for one or more fields of the plurality offields. In some implementations, one or more of the plurality of datasources 105 may transmit a data file containing the plurality of fieldsof demographic information to the server 115.

In some embodiments, the data file may include a table of informationhaving any number of headings labeling a plurality of fields ofdemographic information. For example, as illustrated in FIG. 3, the datafile may include a table having the headings “Name,” “Addrs.,” “PH#,”“FX#,” “Specialty,” “License No.,” and “Expiration Date.” However, asillustrated in FIG. 3, the demographic information provided under theheading “FX#” are a number of email addresses. Furthermore, asillustrated in FIG. 3, one of the entries under the heading “Addrs.”includes a typographical error in the zip code. As further shown in FIG.3, the data file may include extraneous metadata and/or superfluousinformation. Namely, as shown in FIG. 3, the data file may include, forexample, “Author Name” and “Date Generated,” indicated who authored thedata file and the date it was created.

In further embodiments, the data file may include a table of informationhaving a heading and subheadings. For example, as illustrated in FIG.4A, the data file may have a heading labeled “Group” with subheadingslabeled “Name,” “Address #1,” “Address #2,” “Phone No.,” and “Fx #.” Inanother example, as illustrated in FIG. 4B, the data file may have aheading labeled “Group” with subheadings labeled “Name,” “Billing,” and“Service.” In yet another example, as illustrated in FIG. 5A, the datafile may have a heading labeled “Group Name” with subheadings labeled“Name,” “Addr,” “Name,” and “Addr.” Thus, as illustrated in the examplesshown in FIGS. 3-5B, the data file may have inconsistent or mislabelednomenclatures or spurious demographic information. In some instances,the format of each data file having the demographic information may beinconsistent from source to another.

The network 110 may include one or more wired and/or wireless networks.For example, the network 110 may include a cellular network (e.g., along-term evolution (LTE) network, a code division multiple access(CDMA) network, a 3G network, a 4G network, a 5G network, another typeof next generation network, etc.), a public land mobile network (PLMN),a local area network (LAN), a wide area network (WAN), a metropolitanarea network (MAN), a telephone network (e.g., the Public SwitchedTelephone Network (PSTN)), a private network, an ad hoc network, anintranet, the Internet, a fiber optic-based network, a cloud computingnetwork, and/or the like, and/or a combination of these or other typesof networks.

To review and reformat the data files from the data sources 105, theserver 115 may include an ingester 205, a repository 210, a display 215,and a model trainer 220, as illustrated in FIG. 2. In some embodiments,the ingester 205 may analyze the data file using a machine learningmodel trained according to other data files to distinguish between eachof the plurality of fields of demographic information. For example, insome embodiments, the model trainer 220 may train the machine learningmodel using a number of Monte Carlo training sets having sample datafiles. That is, the model trainer 220 may use a sample set generated byhumans identifying demographic information in a data file. In someembodiments, the machine learning model may be based on a plurality ofmachine learning algorithms to identify different types of demographicinformation. In some embodiments, the plurality of machine learningalgorithms may be supervised machine learning algorithms including, butare not limited to, support vector machines, linear regression, logisticregression, naive Bayes, linear discriminant analysis, decision trees,k-nearest neighbor algorithm, neural networks, and similarity learning.It should be understood by those of ordinary skill in the art that theseare merely example supervised machine learning algorithms and that othersupervised machine learning algorithms may be used in accordance withaspects of the present disclosure.

As one example, the ingester 205 may analyze the data file by analyzingsemantic content of each of the plurality of fields of demographicinformation to identify the different types of demographic information.For example, the ingester 205 may identify semantic content, such as astate name or state abbreviation, which indicates that the demographicinformation is likely an address, rather than, for example, a phonenumber or facsimile number. Similarly, the ingester 205 may identifysemantic content, such as street names (e.g., Avenue, Road, Street,Lane, etc.) and/or their associated abbreviations (e.g., Ave., Rd. St.Ln., etc.), which would likewise also indicate that the demographicinformation is an address. Even further, the ingester 205 may identifysemantic content, such as state names (or country names) and/or theirassociated abbreviations, which would likewise also indicate that thedemographic information is an address. In some embodiments, the ingester205 may also be able to identify a billing address based on the semanticcontent. For example, the semantic content may include, for example, aPO Box number, which would indicate that the content is a billingaddress, rather than a service address. In yet another example, theingester 205 may identify the semantic content, such as a hyperlink,which may indicate that the demographic information is an email address.It should be understood by those of ordinary skill in the arts thatthese are merely examples of semantic content that may be identified,and that other types of semantic content are contemplated in accordancewith aspects of the present disclosure.

As another example, the ingester 205 may analyze the data file byanalyzing a shape of each of the plurality of fields of demographicinformation to identify the different types of demographic information.For example, the ingester 205 may analyze the demographic information toidentify the number of characters, the type of the characters (e.g.,numeric versus letter characters), the number of non-alphanumericcharacters (e.g., spaces, commas, periods, or the like), and an overallarrange of the alphanumeric characters and non-alphanumeric characters.For example, the shape of the demographic information may be“XXX[comma][space]XXX” or “XXX[comma][space]XXX [space]X[period]”, witheach X representing a letter character, which are common formatsidentifying names. In another example, the shape of the demographicinformation may be ### XXX[space]XXX [space]XXX[comma]XX[space]##### (or#####=####), with each # representing a numeric character and each Xrepresenting a letter character, which is a common format of an address.However, some data files may use a full state name, rather than the twoletter abbreviation for the state, and as such, the ingester 205 mayidentify the state within an address based on the semantic content, asdiscussed herein. In yet another example, the ingester 205 may identifythe shape of the demographic information, such as XXX@XXX[period]XXXX,which indicates that the demographic information is an email address. Itshould be understood by those of ordinary skill in the arts that theseare merely examples of shapes of demographic content that may beidentified, and that other types of shapes of demographic content arecontemplated in accordance with aspects of the present disclosure.

As yet another example, the ingester 205 may analyze the data file byanalyzing metadata of each of the plurality of fields of demographicinformation to identify the different types of demographic information.For example, the metadata may include each nomenclature of the headings.In some instances, the semantic content and shapes of the demographicinformation may be similar. For example, phone numbers and facsimilenumbers may have similar semantic content and shapes. In anotherexample, service addresses and billing addresses may have similarsemantic content and shapes. To differentiate between demographicinformation having similar semantic content and shapes, the ingester 205may analyze the metadata of the headings (or subheadings). For example,the ingester 205 may identify common nomenclatures used for thedifferent types of demographic information. For example, commonnomenclatures for phone numbers may include, but are not limited to,“Phone No.,” “Phone Number,” “P:,” “PH No.,” or the like, whereas commonnomenclatures for facsimile numbers may include, but are not limited to,“Fax No.,” “Fax Number,” “F:,” “FX No.,” or the like. Likewise, commonnomenclatures for service addresses may include the terms, for example,“Service,” “Serv.,” or the like, or the service address may be listedonly as “Address” or some variation thereof, whereas the billing addressmay be specifically identified as such. Furthermore, the ingester 205may analyzed layered headings, as illustrated in the examples shown inFIGS. 3 and 4A-B. Using the data file shown in FIG. 3, the ingester 205may analyze the headings “Author Name” and “Date Generated,” anddetermine that these fields are merely extraneous metadata and/orsuperfluous information that should be removed when reformatting thedata file. As another example, using the data file shown in FIG. 4A, theingester 205 may analyze the primary heading and subheadings, anddetermine that the demographic information provided below the primaryheading is related to a practice group, i.e., a group name, groupservice address, group billing address, group phone number, and groupfacsimile number. In yet another example, using the data file shown inFIG. 4B, the ingester 205 may analyze the primary heading andsubheadings, and determine that the demographic information providedbelow the primary heading is related to a practice group, i.e., a groupname, however the remaining subheadings are “Service” and “Billing,” andthe ingester 205 may determine that the demographic information providedunder these subheadings are a billing address, billing phone number,service address, and service phone, respectively.

In some embodiments, the machine learning model may also be trained onrespective rules for common types of demographic information. Forexample, the rules may include a rule that a five digit number or a fivedigit number followed by a hyphen and another four digit number is a zipcode, as these are the only available formats for zip codes. As anotherexample, an NPI may be formatted as a ten digit number with the firstdigit being a “1,” and as such, the rules may include a rule indicatingthat any ten digit number commencing with a “1” is an NPI. In a furtherexample, the rules may include a rule for determining responses tobinary pieces of demographic information, e.g., whether a healthcareprovider is accepting new patients—“Yes”/“Y” or “No”/“N.” By using rulesfor common types of demographic information, the present disclosureavoids overtraining, and thus overcomplicating, the machine learningmodel and also improves efficiency of the machine learning model. Insome embodiments, these rules may be defined as regular expressions,however it should be understood by those ordinary skill in the arts thatother types of rules may be used.

In some embodiments, the ingester 205 may analyze the inter-columnarrelationship between multiple columns. For example, as illustrated inFIG. 5A, the data file includes alternating headings of “Name” and“Addr.” After reviewing the semantic content, shape, and metadata of therows under each column, the ingester 205 may determine that therespective types of demographic information are names and addresses.Furthermore, by analyzing the inter-columnar relationship betweenmultiple columns, the ingester 205 may determine that the alternatingheadings should be grouped as pairs, e.g., a healthcare provider nameand their associated address. As another example illustrated in FIG. 5B,the data file may include multiple addresses for a single healthcareprovider, i.e., “Addrs. 1,” “City 1,” “State 1,” as well as “Addrs. 2,”“City 2,” “State 2.” In this instance, the ingester 205 may determinethat each address is associated with the same healthcare provider, andseparate each address into separate entries, e.g., separate row ofinformation, in a revised data file, while still associating theaddresses with the same healthcare provider.

The ingester 205 may also generate a score indicating a probability thateach of the plurality of fields of demographic information wasidentified correctly. For example, the ingester 205 may generate abaseline score for each of the plurality of fields of demographicinformation, which may then be adjusted. For example, the ingester 205may increase the scores for demographic information having well-knownsemantic content and/or shapes, e.g., zip codes and NPIs. Additionally,the ingester 205 may increase or decrease the score based on whether theheading correctly identifies the associated demographic information,e.g., whether the heading correctly identifies “NPIs.” For example, thescore may be decreased when the heading and the content do not match,whereas the score may be increased when the heading and content match.In some embodiments, ingester 205 may increase the score based onwhether demographic information having similar semantic content and/orshapes have been detected. For example, the ingester 205 increases thescore for a telephone number or address if only a single piece ofdemographic information having the given semantic content and/or shapeis identified. However, in the event two or more identified fields ofdemographic information having the same semantic content and/or shapeare identified (e.g., a phone number and a facsimile number or a serviceaddress and a billing address), the ingester 205 may decrease the scorefor both of the two or more identified fields of demographicinformation, and these identified fields may have the same score.Furthermore, in some situations, the ingester 205 may generate an alertnotifying an administrator of the two or more identified fields ofdemographic information having the same semantic content and/or shape,such that the administrator may provide input to resolve the conflict.

To resolve this, the ingester 205 may apply additional processing todistinguish between the two or more identified fields of demographicinformation. For example, in some embodiments, the ingester 205 maycross-check at least one of the plurality of fields of demographicinformation against known demographic information stored in, forexample, the repository 210. For example, the ingester 205 maycross-check an identified phone number and an identified facsimilenumber against known phone numbers and facsimile numbers to verify whichis the phone number and which is the facsimile number. In someembodiments, the ingester 205 may sequentially check the digits of thephone and facsimile numbers until the ingester 205 determines that oneof the two is a phone number. In some instances, only one of the twoidentified fields of demographic information may be known, e.g., thephone number, and the ingester 205 may identify one of the two or moreidentified fields of demographic information, accordingly, with theremaining field of demographic information being identified as the mostreasonable alternative (e.g., the facsimile number). Similarly, theingester 205 may cross-check other pieces of demographic information,such as the NPI, service addresses, and billing addresses. It should beunderstood by those of ordinary skill in the arts that these are merelyexamples of the types of demographic information that may becross-checked, and that other types of demographic information may becross-checked in accordance with aspects of the present disclosure.

Additionally, the ingester 205 may identify incorrect information and,in some instances, update the incorrect information. For example, asillustrated in FIG. 3, the zip code in the address associated with “JaneDoe” included a typographical error, and to fix this error, the ingester205 may query the repository 210 to identify a correct zip.Additionally, or alternatively, the ingester 205 may compare theincorrect zip code to other zip codes of the data file, e.g., the zipcode associated with “John Doe,” as illustrated in FIG. 3. As theaddresses of “Jane Doe” and “John Doe” have the same street address,city, and state, the ingester 205 may determine the zip code associatedwith “John Doe” is the correct zip code and update the zip code for“Jane Doe” accordingly. Additionally, the ingester 205 may determinewhether identified information is corrected by cross-checking, forexample, identified phone numbers against known phone numbers. In someinstances, the cross-checking may confirm that the identified numbersare indeed phone numbers. In other instances, the cross-checking maydetermine that the identified phone numbers were incorrectly labeled inthe data file, and in fact, are facsimile numbers, rather than phonenumbers.

In some embodiments, the ingester 205 may analyze a limited number ofrows of demographic information in the data file (i.e., less than thefull number of rows in the data file) to improve the overall efficiencyof the ingester 205. For example, after analyzing the semantic content,shape, and metadata of a number of rows, the ingester 205 may be able toidentify the type of demographic information of each of the plurality offields of demographic information, and assume that all remaining rowsthat have not been analyzed are the identified type of demographicinformation. Furthermore, the ingester 205 may generate the revised datafile in smaller segments of rows, rather than the entire data file,which may require substantial amounts of resources, e.g., processingconsumption and memory resources. By assuming the type of demographicinformation of the remaining rows, the ingester 205 reduces the overallamount of resources used and improves the efficiency of the server 115.

Once the plurality of fields of demographic information have beenidentified and corrected as needed, the ingester 205 may generate arevised data file labeling each of the plurality of fields ofdemographic information based on the identified type. In someembodiments, the ingester 205 may generate a revised data file having aformat that is customized according to a request from the data source105. For example, the requested format may be a format that isconsistent with preexisting data files of the data source 105. Asanother example, the requested format may be an entirely new format. Forexample, as illustrated in FIG. 6, the data source 105 may request thatthe demographic information be separated into “F Name,” “L Name,”“Street Address,” “City,” “State,” and “Zip Code.” To achieve this, theingester 205 may identify fields for the requested format and parsethrough the identified types of demographic information to determinewhich demographic information belongs in which field of the requestedformat. That is, for example, when the ingester 205 identified thedemographic information as being “Last Name, First Name” or “Full Name,”the ingester 205 may parse the demographic information and separate theminto different fields in the revised data file, i.e., “First Name” and“Last name.” That is, the ingester may generate new columns byseparating a column of a single type of demographic information (e.g.,“Full Name”) into different separate columns parsing the single type ofdemographic information into separate subcomponents (e.g., “First Name”and “Last Name” as separate columns). Likewise, the ingester 205 maygenerate a new columns by combining separate columns of information(e.g., “First Name” and “Last Name”) into a single column (e.g., “FullName”). It should be understood by those of ordinary skill in the artsthat this is merely an example, and that the ingester 205 may parseother types of demographic information in accordance with aspects of thepresent disclosure. In further embodiments, the ingester 205 mayseparate a single incoming data file into any number of revised datafiles.

In some instances, a given piece of demographic information may notmatch what the ingester 205 identified as the type of demographicinformation. For example, the ingester 205 may identify one of theplurality of fields of demographic information as being NPIs, but oneentry may not match the known format for an NPI. In such circumstances,the ingester 205 may pass through the mismatching demographicinformation untouched, render the value null, or insert specialcharacters flagging the particular entry. Alternatively, the ingester205 may generate an alert notifying an administrator of the mismatchingdemographic information, such that the administrator may provide inputto resolve the discrepancy.

In some embodiments, the ingester 205 may determine additionalinformation based on the identified demographic information. Forexample, using the address of the identified address, the ingester 205may determine the geolocation or coordinates of the healthcare provider.As another example, the ingester 205 may supplement a missing zip codebased on a known street address, city, and state. The ingester 205 mayinclude such additional information in the revised data file uponrequest. The ingester 205 may store the revised data file in therepository 210, and the server 115 may transmit the revised data file tothe data source 105 over the network 110.

FIG. 7 illustrates a method for identifying demographic information in adata file.

At 705, a computing device, e.g., server 115, may receive the data filecontaining a plurality of fields of demographic information from athird-party. The data file may have inconsistent or mislabelednomenclatures for one or more fields of the plurality of fields orspurious demographic information.

At 710, the computing device, e.g., server 115, may analyze the datafile using a machine learning model trained according to other datafiles to distinguish between each of the plurality of fields ofdemographic information. The machine learning model may be based on aplurality of machine learning algorithms to identify different typesdemographic information.

At 715, the computing device, e.g., server 115, may generate a scoreindicating a probability that each of the plurality of fields ofdemographic information was identified correctly.

At 720, the computing device, e.g., server 115, may generate a reviseddata file labeling each of the plurality of fields of demographicinformation based on the identified type.

Each of the servers and modules described above can be implemented insoftware, firmware, or hardware on a computing device. A computingdevice can include but are not limited to, a personal computer, a mobiledevice such as a mobile phone, workstation, embedded system, gameconsole, television, set-top box, or any other computing device.Further, a computing device can include, but is not limited to, a devicehaving a processor and memory, including a non-transitory memory, forexecuting and storing instructions. The memory may tangibly embody thedata and program instructions in a non-transitory manner. Software mayinclude one or more applications and an operating system. Hardware caninclude, but is not limited to, a processor, a memory, and a graphicaluser interface display. The computing device may also have multipleprocessors and multiple shared or separate memory components. Forexample, the computing device may be a part of or the entirety of aclustered or distributed computing environment or server farm.

Various embodiments may be implemented, for example, using one or morewell-known computer systems, such as computer system 800 shown in FIG.8. One or more computer systems 800 may be used, for example, toimplement any of the embodiments discussed herein, as well ascombinations and sub-combinations thereof.

Computer system 800 may include one or more processors (also calledcentral processing units, or CPUs), such as a processor 804. Processor804 may be connected to a communication infrastructure or bus 806.

Computer system 800 may also include user input/output device(s) 803,such as monitors, keyboards, pointing devices, etc., which maycommunicate with communication infrastructure 806 through userinput/output interface(s) 802.

One or more of processors 804 may be a graphics processing unit (GPU).In an embodiment, a GPU may be a processor that is a specializedelectronic circuit designed to process mathematically intensiveapplications. The GPU may have a parallel structure that is efficientfor parallel processing of large blocks of data, such as mathematicallyintensive data common to computer graphics applications, images, videos,etc.

Computer system 800 may also include a main or primary memory 808, suchas random access memory (RAM). Main memory 808 may include one or morelevels of cache. Main memory 808 may have stored therein control logic(i.e., computer software) and/or data.

Computer system 800 may also include one or more secondary storagedevices or memory 810. Secondary memory 810 may include, for example, ahard disk drive 812 and/or a removable storage device or drive 814.Removable storage drive 814 may be a floppy disk drive, a magnetic tapedrive, a compact disk drive, an optical storage device, tape backupdevice, and/or any other storage device/drive.

Removable storage drive 814 may interact with a removable storage unit818. Removable storage unit 818 may include a computer usable orreadable storage device having stored thereon computer software (controllogic) and/or data. Removable storage unit 818 may be a floppy disk,magnetic tape, compact disk, DVD, optical storage disk, and/ any othercomputer data storage device. Removable storage drive 814 may read fromand/or write to removable storage unit 818.

Secondary memory 810 may include other means, devices, components,instrumentalities or other approaches for allowing computer programsand/or other instructions and/or data to be accessed by computer system800. Such means, devices, components, instrumentalities or otherapproaches may include, for example, a removable storage unit 822 and aninterface 820. Examples of the removable storage unit 822 and theinterface 820 may include a program cartridge and cartridge interface(such as that found in video game devices), a removable memory chip(such as an EPROM or PROM) and associated socket, a memory stick and USBport, a memory card and associated memory card slot, and/or any otherremovable storage unit and associated interface.

Computer system 800 may further include a communication or networkinterface 824. Communication interface 824 may enable computer system800 to communicate and interact with any combination of externaldevices, external networks, external entities, etc. (individually andcollectively referenced by reference number 828). For example,communication interface 824 may allow computer system 800 to communicatewith external or remote devices 828 over communications path 826, whichmay be wired and/or wireless (or a combination thereof), and which mayinclude any combination of LANs, WANs, the Internet, etc. Control logicand/or data may be transmitted to and from computer system 800 viacommunication path 826.

Computer system 800 may also be any of a personal digital assistant(PDA), desktop workstation, laptop or notebook computer, netbook,tablet, smart phone, smart watch or other wearable, appliance, part ofthe Internet-of-Things, and/or embedded system, to name a fewnon-limiting examples, or any combination thereof.

Computer system 800 may be a client or server, accessing or hosting anyapplications and/or data through any delivery paradigm, including butnot limited to remote or distributed cloud computing solutions; local oron-premises software (“on-premise” cloud-based solutions); “as aservice” models (e.g., content as a service (CaaS), digital content as aservice (DCaaS), software as a service (SaaS), managed software as aservice (MSaaS), platform as a service (PaaS), desktop as a service(DaaS), framework as a service (FaaS), backend as a service (BaaS),mobile backend as a service (MBaaS), infrastructure as a service (IaaS),etc.); and/or a hybrid model including any combination of the foregoingexamples or other services or delivery paradigms.

Any applicable data structures, file formats, and schemas in computersystem 800 may be derived from standards including but not limited toJavaScript Object Notation (JSON), Extensible Markup Language (XML), YetAnother Markup Language (YAML), Extensible Hypertext Markup Language(XHTML), Wireless Markup Language (WML), MessagePack, XML User InterfaceLanguage (XUL), a comma-separated values (CSV), or any otherfunctionally similar representations alone or in combination.Alternatively, proprietary data structures, formats or schemas may beused, either exclusively or in combination with known or open standards.

In some embodiments, a tangible, non-transitory apparatus or article ofmanufacture comprising a tangible, non-transitory computer useable orreadable medium having control logic (software) stored thereon may alsobe referred to herein as a computer program product or program storagedevice. This includes, but is not limited to, computer system 800, mainmemory 808, secondary memory 810, and removable storage units 818 and822, as well as tangible articles of manufacture embodying anycombination of the foregoing. Such control logic, when executed by oneor more data processing devices (such as computer system 800), may causesuch data processing devices to operate as described herein.

Based on the teachings contained in this disclosure, it will be apparentto persons skilled in the relevant art(s) how to make and useembodiments of this disclosure using data processing devices, computersystems and/or computer architectures other than that shown in FIG. 8.In particular, embodiments can operate with software, hardware, and/oroperating system embodiments other than those described herein.

Conclusion

The present invention has been described above with the aid offunctional building blocks illustrating the implementation of specifiedfunctions and relationships thereof. The boundaries of these functionalbuilding blocks have been arbitrarily defined herein for the convenienceof the description. Alternate boundaries can be defined so long as thespecified functions and relationships thereof are appropriatelyperformed.

The foregoing description of the specific embodiments will so fullyreveal the general nature of the invention that others can, by applyingknowledge within the skill of the art, readily modify and/or adapt forvarious applications such specific embodiments, without undueexperimentation, without departing from the general concept of thepresent invention. Therefore, such adaptations and modifications areintended to be within the meaning and range of equivalents of thedisclosed embodiments, based on the teaching and guidance presentedherein. It is to be understood that the phraseology or terminologyherein is for the purpose of description and not of limitation, suchthat the terminology or phraseology of the present specification is tobe interpreted by the skilled artisan in light of the teachings andguidance.

The breadth and scope of the present invention should not be limited byany of the above-described exemplary embodiments, but should be definedonly in accordance with the following claims and their equivalents.

What is claimed is:
 1. A computer-implemented method of identifyingdemographic information in a data file, comprising: receiving the datafile containing a plurality of fields of demographic information from athird-party, the data file having inconsistent or mislabelednomenclatures for one or more fields of the plurality of fields orspurious demographic information; analyzing the data file using amachine learning model trained according to other data files todistinguish between each of the plurality of fields of demographicinformation, the machine learning model being based on a plurality ofmachine learning algorithms to identify different types demographicinformation; generating a score indicating a probability that each ofthe plurality of fields of demographic information was identifiedcorrectly; and generating a revised data file labeling each of theplurality of fields of demographic information based on the identifiedtype.
 2. The method of claim 1, wherein analyzing the data filecomprises analyzing semantic content of each of the plurality of fieldsof demographic information to identify the different types ofdemographic information.
 3. The method of claim 1, wherein analyzing thedata file comprises analyzing a shape of each of the plurality of fieldsof demographic information to identify the different types ofdemographic information.
 4. The method of claim 1, wherein analyzing thedata file comprises analyzing metadata of each of the plurality offields of demographic information to identify the different types ofdemographic information.
 5. The method of claim 4, wherein the metadataincludes each nomenclature of each of the plurality of fields ofdemographic information.
 6. The method of claim 1, wherein, in responseto identifying different ones of the plurality of fields of demographicinformation, the method further comprises cross-checking at least one ofthe plurality of fields of demographic information against knowndemographic information.
 7. The method of claim 1, further comprisingtransmitting the revised data file to the third-party.
 8. A system foridentifying demographic information in a data file, comprising: a memorythat stores instructions for identifying the demographic information inthe data file; and a processor configured to execute the instructionsthat cause the processor to: receive the data file containing aplurality of fields of demographic information from a third-party, thedata file having inconsistent or mislabeled nomenclatures for one ormore fields of the plurality of fields or spurious demographicinformation; analyze the data file using a machine learning modeltrained according to other data files to distinguish between each of theplurality of fields of demographic information, the machine learningmodel being based on a plurality of machine learning algorithms toidentify different types demographic information; generate a scoreindicating a probability that each of the plurality of fields ofdemographic information was identified correctly; and generate a reviseddata file labeling each of the plurality of fields of demographicinformation based on the identified type.
 9. The system of claim 8,wherein analyzing the data file comprises analyzing semantic content ofeach of the plurality of fields of demographic information to identifythe different types of demographic information.
 10. The system of claim8, wherein analyzing the data file comprises analyzing a shape of eachof the plurality of fields of demographic information to identify thedifferent types of demographic information.
 11. The system of claim 10,wherein the metadata includes each nomenclature of each of the pluralityof fields of demographic information.
 12. The system of claim 8, whereinanalyzing the data file comprises analyzing each nomenclature toidentify the different types of demographic information.
 13. The systemof claim 8, wherein, in response to identifying different ones of theplurality of fields of demographic information, the instructions furthercause the processor to cross-check at least one of the plurality offields of demographic information against known demographic information.14. The system of claim 8, wherein the instructions further cause theprocessor to transmit the revised data file to the third-party. 15.non-transitory program storage device having instructions stored thereonthat, when executed by at least one computing device, causes the atleast one computing device to perform a method, the method comprising:receiving the data file containing a plurality of fields of demographicinformation from a third-party, the data file having inconsistent ormislabeled nomenclatures for one or more fields of the plurality offields or spurious demographic information; analyzing the data fileusing a machine learning model trained according to other data files todistinguish between each of the plurality of fields of demographicinformation, the machine learning model being based on a plurality ofmachine learning algorithms to identify different types demographicinformation; generating a score indicating a probability that each ofthe plurality of fields of demographic information was identifiedcorrectly; and generating a revised data file labeling each of theplurality of fields of demographic information based on the identifiedtype.
 16. The method of claim 15, wherein analyzing the data filecomprises analyzing semantic content of each of the plurality of fieldsof demographic information to identify the different types ofdemographic information.
 17. The method of claim 15, wherein analyzingthe data file comprises analyzing a shape of each of the plurality offields of demographic information to identify the different types ofdemographic information.
 18. The method of claim 15, wherein analyzingthe data file comprises analyzing metadata of each of the plurality offields of demographic information to identify the different types ofdemographic information.
 19. The method of claim 18, wherein themetadata includes each nomenclature of each of the plurality of fieldsof demographic information.
 20. The method of claim 15, wherein, inresponse to identifying different ones of the plurality of fields ofdemographic information, the method further comprises cross-checking atleast one of the plurality of fields of demographic information againstknown demographic information.